Lab 3.1 - Bootstrapping

Author

Student

Published

April 23, 2024

Creating a bootstrap simulation

For this assignment, we will be using data from the Washington D.C. bicycle sharing system. In particular, we want to try and create accurate confidence intervals for various features of the dataset.

You can find the variable definitions here.

Expectations

  1. Before starting, write a sentence or two on your expectation of the confidence interval for two variables cnt (number of bicycles rented that day) and temp

Setup

  1. Install the library infer (documentation here) and load it

  2. Review the basic code process here: here

Create a sample

  1. Make a subset of the data that is a random sample of size 50 using the following command:
dcbikes_sample <- dcbikes %>% 
  slice_sample(n = 50, replace = TRUE)

Classical confidence intervals

  1. Check the conditions and create the 95% confidence intervals by hand for the two variables

  2. Compare these results to the overall dataset values - did your confidence interval cover the true value? How close was your sample to reality?

Bootstrapping

  1. Using the rep_sample_n(size = 50, replace = TRUE, reps=100000) function, sampling with replacement. Make sure to follow the code example on the moderndive webpage to generate your sampling distribution of means.

To run create replications, you can so so by modifying the following sample code:

Note: VERY IMPORTANT - use your sample of size 50, and then sample from THAT sample.

virtual_resamples <- pennies_sample %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 10000)

To generate a list of sample means, in particular, you’ll need to modify this sample code to correspond to your data:

virtual_resampled_means <- virtual_resamples %>% 
  group_by(replicate) %>% 
  summarize(mean_year = mean(year))

Remember that when bootstrapping, you should make sure that the size argument is set to be the size of your sample - you want to use all of the information possible from your sample!

  1. Create a confidence interval using the quantile() function on this sampling distribution. How do these compare to your confidence intervals calculated in the classical way?

In particular, you want to check cutoff at the 0.025 and 0.975 range of your data (the range in which 95% of the sample means fell).

Hint: you can find documentation on using the quantile() function by typing ?quantile in the Console window

  1. Compare the result of this confidence interval you generated by bootstrapping to the ones calculated by classical methods. How close were they? Do the differences surprise you or not?

  2. Think carefully about what the difference is between a confidence interval calculated by classical methods and the one generated by bootstrapping. What are the differences in key assumptions?

If you have extra time, you can try to use the alternative workflow described here.

If you still have extra time, you can try to bootstrap regression lines via the method described here